28 July 2017

S0: Prior to lecture

Preparing for this lecture

  • Please run these codes on your laptop,
## Might be a while...
install.packages(c("ggplot2","dplyr","tidyr","janitor","plotly",
                   "devtools","learnr","gapminder")) 

library(devtools)
install_github("kevinwang09/2017_STAT3914", subdir = "learnr3914")
  • Familiar yourself with the iris dataset. Typing iris into R console should load this data. Pay attention to its column, row names and structure of each column.

S1: Necessary of Applied Statistics

Good statistical discoveries don't fall out from the sky

  • Statisticians are great at many things:

    1. Understanding data characteristics
    2. Building statistical/mathematical models
    3. Repeat 1 and 2…like…a lot…
    4. Extract insights
  • But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)

Let \(\boldsymbol{X}\) be the thing I want…

  • The real problem is not applying fancy shampoo for your cat. It is getting your cat into the bathtub.

Hidden side of being a statistician

  • Assume we have data
  • Assume we have questions, which the data can be used to answer
  • Assume we have cleaned data
  • Assume we interrogated the right aspects of the data using appropriate statistics
  • Assume we did everything right, communicate insights with others

Summary of this lecture

  • Powerful tools for data preparation.
  • Passive learning is not going to work.
  • Our goal: clean the dirtyIris data to be exactly the same as the original iris data.
  • S1: Introduction
  • S2: Reading in data using readr and readxl
  • S3: Basic data cleaning using janitor
  • S4: Clean coding using magrittr
  • S5: Data filtering using dplyr
  • S6: Data visualisation using ggplot2
  • S7: Conclusion

S2: Reading data

Better read/write data

  • base R functions are not sufficient for modern uses.

  • readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.

library(readr)

Reading data using readr (1)

dirtyIris = readr::read_csv("dirtyIris.csv")
Parsed with column specification:
cols(
  SepAl....LeNgth = col_double(),
  `Sepal.?    Width` = col_double(),
  `petal.Length(*&^` = col_double(),
  `petal.$#^&Width` = col_double(),
  `SPECIES^` = col_character(),
  allEmpty = col_character()
)
class(dirtyIris) ## `tibble` is a `data.frame` with better formatting.
[1] "tbl_df"     "tbl"        "data.frame"
  • readxl and haven (for SAS, SPSS etc) packages work similarly.

Reading data using readr (2)

dirtyIris
# A tibble: 250 x 6
  SepAl....LeNgth `Sepal.?    Width` `petal.Length(*&^` `petal.$#^&Width`
            <dbl>              <dbl>              <dbl>             <dbl>
1      5.80000000                2.6                4.0               1.2
2      5.80000000                2.7                5.1               1.9
3      0.01874617                 NA                 NA                NA
4              NA                 NA                 NA                NA
5              NA                 NA                 NA                NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
#   allEmpty <chr>
  • We now proceed to data cleaning on the dirtyIris dataset.

Too trivial? Here is a short homework

Here is a dataset. Click here.

  1. Write 2 sentences about what is a .gmt file and who publishes this format?
  2. Which packages can read in .gmt files?
  3. How to download this package?
  4. What class is this data once read into R? Is it a data.frame?
  5. The data contains 50 different gene-sets. What is the size of each gene-set?
  6. What is the mostly frequent mentioned 6 genes?

S3: Cleaned data

What is clean data?

Clean data is a data set that allows you to do statistical modelling without extra processing

  1. Good documentation on the entire data.
  2. Each column is a variable. The name should be informative, and:
    • No bad characters/formatting @KevinWang009
    • No inconsistent capitalisation or separators (Cricket_australia vs cricket.Australia)
  3. Each row is an observation:
    • No bad characters
    • No poorly designed row names (3, 2, 5, … )
    • No repeated row names (a, a.1, b, b.1, … )

Data cleaning in R

  • Clean data is a well-designed data.frame.

  • Column type (esp. dates and factors) handling was the primary reason we used readr instead of base R when importing data.

  • Our goal: clean the dirtyIris data to be exactly the same as the original iris data.

    • Basic data cleaning using janitor package.
    • More advanced data manipulation through dplyr.

janitor: basic data cleaning

  • Clean up the bad column names
library(janitor)
glimpse(dirtyIris)
Observations: 250
Variables: 6
$ SepAl....LeNgth  <dbl> 5.80000000, 5.80000000, 0.01874617, NA, NA, 6...
$ Sepal.?    Width <dbl> 2.6, 2.7, NA, NA, NA, 3.0, NA, 2.9, NA, NA, 3...
$ petal.Length(*&^ <dbl> 4.0, 5.1, NA, NA, NA, 5.5, NA, 5.6, NA, NA, 1...
$ petal.$#^&Width  <dbl> 1.2, 1.9, NA, NA, NA, 1.8, NA, 1.8, NA, NA, 0...
$ SPECIES^         <chr> "versicolor", "virginica", NA, NA, NA, "virgi...
$ allEmpty         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## Clean up column names
better = clean_names(dirtyIris) 
glimpse(better)
Observations: 250
Variables: 6
$ sepal_length <dbl> 5.80000000, 5.80000000, 0.01874617, NA, NA, 6.500...
$ sepal_width  <dbl> 2.6, 2.7, NA, NA, NA, 3.0, NA, 2.9, NA, NA, 3.8, ...
$ petal_length <dbl> 4.0, 5.1, NA, NA, NA, 5.5, NA, 5.6, NA, NA, 1.9, ...
$ petal_width  <dbl> 1.2, 1.9, NA, NA, NA, 1.8, NA, 1.8, NA, NA, 0.4, ...
$ species      <chr> "versicolor", "virginica", NA, NA, NA, "virginica...
$ allempty     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

janitor: removal of missing values

  • Purely empty rows/columns are non-informative.
## Removing empty rows/columns
evenBetter = remove_empty_rows(better)
evenBetter = remove_empty_cols(evenBetter)

glimpse(evenBetter)
Observations: 209
Variables: 5
$ sepal_length <dbl> 5.80000000, 5.80000000, 0.01874617, 6.50000000, 6...
$ sepal_width  <dbl> 2.6, 2.7, NA, 3.0, 2.9, NA, 3.8, 2.8, NA, 3.0, 3....
$ petal_length <dbl> 4.0, 5.1, NA, 5.5, 5.6, NA, 1.9, 4.8, NA, 4.2, 5....
$ petal_width  <dbl> 1.2, 1.9, NA, 1.8, 1.8, NA, 0.4, 1.8, NA, 1.5, 2....
$ species      <chr> "versicolor", "virginica", NA, "virginica", "virg...
  • Genuinely missing values should be retained, but in this case, the NA's were added.

  • Removing any rows with NA

evenBetterBetter = remove_missing(evenBetter) 
cleanIris = evenBetterBetter
  • We now have agreement over the size of the two data
glimpse(cleanIris)
Observations: 150
Variables: 5
$ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,...
$ sepal_width  <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,...
$ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,...
$ petal_width  <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,...
$ species      <chr> "versicolor", "virginica", "virginica", "virginic...
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

S4: Clean coding (skim through)

Coding complexity increases with the number of brackets

  • The "inside out" structure of coding isn't great for human reading.
mean(cleanIris$sepal_length)
[1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)

Piping: read code from left to right

  • We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".

  • Compounded operations are possible. Keyboard shortcut is Cmd+shift+M.

cleanIris$sepal_length %>% mean
[1] 5.843333
cleanIris$sepal_length %>%
  density %>%
  plot(col = "red", lwd = 2)

S5: dplyr: data subsetting master

Traditional way of subsetting data in R (1)

  • If I want the first two rows of column sepal_length and sepal_width in the cleanIris data:
## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]

## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]

## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]

Traditional way of subsetting data in R (2)

  • Something more realistic: we want to extract rows based on some compounded criteria and select columns based on special keywords.
cleanIris[(cleanIris[,"sepal_length"] < 5) &
            (cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
# A tibble: 4 x 2
  petal_length sepal_length
         <dbl>        <dbl>
1          1.3          4.5
2          1.4          4.4
3          3.3          4.9
4          4.5          4.9
  • (Optional) A pro R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.
subset(cleanIris,
       subset = (sepal_length < 5) & (sepal_width < 3),
       select = grep("length", colnames(cleanIris), value = TRUE))
# A tibble: 4 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          4.5          1.3
2          4.4          1.4
3          4.9          3.3
4          4.9          4.5

Subsetting data using dplyr

  • Think of subsetting rows and columns as two separate different procedures:
  • select columns are operations on variables, and
  • filter rows are operations on observations

  • See dplyr cheatsheet.

library(dplyr)

cleanIris %>%
  filter(sepal_length < 5,
         sepal_width < 3) %>%
  select(contains("length"))
# A tibble: 4 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          4.5          1.3
2          4.4          1.4
3          4.9          3.3
4          4.9          4.5

dplyr is much more!

  • mutate creates new columns
iris_mutated = mutate(cleanIris,
      newVar = sepal_length + petal_length,
      newNewVar = newVar*2)

iris_mutated
# A tibble: 150 x 7
  sepal_length sepal_width petal_length petal_width    species newVar
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>  <dbl>
1          5.8         2.6          4.0         1.2 versicolor    9.8
2          5.8         2.7          5.1         1.9  virginica   10.9
3          6.5         3.0          5.5         1.8  virginica   12.0
4          6.3         2.9          5.6         1.8  virginica   11.9
5          5.1         3.8          1.9         0.4     setosa    7.0
# ... with 145 more rows, and 1 more variables: newNewVar <dbl>
  • group_by + summarise will create summary statistics for grouped variables
bySpecies = cleanIris %>%
  group_by(species)

bySpecies
# A tibble: 150 x 5
# Groups:   species [3]
  sepal_length sepal_width petal_length petal_width    species
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>
1          5.8         2.6          4.0         1.2 versicolor
2          5.8         2.7          5.1         1.9  virginica
3          6.5         3.0          5.5         1.8  virginica
4          6.3         2.9          5.6         1.8  virginica
5          5.1         3.8          1.9         0.4     setosa
# ... with 145 more rows
bySpecies %>%
  summarise(meanSepalLength = mean(sepal_length))
# A tibble: 3 x 2
     species meanSepalLength
       <chr>           <dbl>
1     setosa           5.006
2 versicolor           5.936
3  virginica           6.588
  • left_join for merging data
flowers = data.frame(species = c("setosa", "versicolor", "virginica"),
                     comments = c("meh", "kinda_okay", "love_it!"))

## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")

## Randomly sampling 6 rows 
sample_n(iris_comments, 6) 
# A tibble: 6 x 6
  sepal_length sepal_width petal_length petal_width    species   comments
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>     <fctr>
1          5.8         2.6          4.0         1.2 versicolor kinda_okay
2          6.4         2.8          5.6         2.2  virginica   love_it!
3          6.1         3.0          4.6         1.4 versicolor kinda_okay
4          4.9         2.4          3.3         1.0 versicolor kinda_okay
5          5.5         3.5          1.3         0.2     setosa        meh
6          6.0         2.2          4.0         1.0 versicolor kinda_okay

Checking if we cleaned the data properly

  • arrange for ordering rows
arrangeCleanIris = cleanIris %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)

## The true iris data
arrangeIris = iris %>% 
  clean_names() %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)
  • We sorted both the processed dirtyIris data and the arranged iris data.
## The `Species` column is character or factor
all.equal(arrangeCleanIris, arrangeIris) 
[1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% 
  mutate(species = as.character(species)) 

## Great! 
all.equal(arrangeCleanIris, arrangeIris)
[1] TRUE
## identical is stronger version of all.equal
## arrangeCleanIris is a tibble
## but arrangeIris is a data.frame
identical(arrangeCleanIris, arrangeIris) 
[1] FALSE
class(arrangeCleanIris)
[1] "tbl_df"     "tbl"        "data.frame"
class(arrangeIris)
[1] "data.frame"
## Now, they are exactly the same!
identical(arrangeCleanIris, arrangeIris %>% as.tibble)
[1] TRUE

dplyr special select functions (advanced)

  • select only if certain string is present
cleanIris %>% 
  select(ends_with("length"))
# A tibble: 150 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          5.8          4.0
2          5.8          5.1
3          6.5          5.5
4          6.3          5.6
5          5.1          1.9
# ... with 145 more rows
cleanIris %>% 
  select(starts_with("sepal"))
# A tibble: 150 x 2
  sepal_length sepal_width
         <dbl>       <dbl>
1          5.8         2.6
2          5.8         2.7
3          6.5         3.0
4          6.3         2.9
5          5.1         3.8
# ... with 145 more rows
  • select only if a column satisfy a certain condition
bySpecies %>%
  summarise_if(is.numeric,
               funs(m = mean))
# A tibble: 3 x 5
     species sepal_length_m sepal_width_m petal_length_m petal_width_m
       <chr>          <dbl>         <dbl>          <dbl>         <dbl>
1     setosa          5.006         3.428          1.462         0.246
2 versicolor          5.936         2.770          4.260         1.326
3  virginica          6.588         2.974          5.552         2.026

S6: ggplot2: the best visualisation package

ggplot2 tutorial sheet

  • If you managed to install all packages successfully, you should be able to run the following to get an interactive tutorial sheet.
library(learnr3914)
learnggplot2()
  • Otherwise, please download and compile the "ggplot2_basic_tutorial.Rmd" here

  • If all fails, please send me an email at kevin.wang@sydney.edu.au.

ggplot2: the philosophy

  • Di Cook - the real reason that you should use ggplot2 is that, its design will force you to use a certain grammar when producing a plot. So much so, every plot you produce is actually a statistic.
library(ggplot2)
p1 =
  iris %>%
  ggplot(aes(x = Sepal.Length,
             y = Sepal.Width,
             colour = Species)) +
  geom_point(size = 3)

p1 

S7: Conclusion

tidy data, coding, modelling and reporting

  • tidyverse is a collection of 20+ packages built on the philosophy of being organised for the purpose of collaboration.

  • These functions:
    • They integrate against each other well.
    • They are NOT ad-hoc programming solutions.
    • They will always throw errors at you if you don't have a thorough understanding of your data.

A gallery (optional)

Advice in the future

  • Use RStudio + RMarkdown to document your codes.

  • Learn some computational tools. They are not statistics, but not learning them could inhibit your career aspects.

  • Find "cool" components and adapt those into your work routine. (Hint: start with all RStudio cheatsheets and build up gradually.)

  • Take time to re-analyse an old dataset.

  • Learn core functions and vignette.

  • Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.

Session Info and References

  • Dr. Garth Tarr
  • tidyverse.org
  • github.com/sfirke/janitor
  • gapminder.org
  • rstudio.com
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2    janitor_0.3.0   plotly_4.7.0    dplyr_0.7.2    
 [5] purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3   
 [9] ggplot2_2.2.1   tidyverse_1.1.1 knitr_1.16     

loaded via a namespace (and not attached):
 [1] reshape2_1.4.2    haven_1.1.0       lattice_0.20-35  
 [4] colorspace_1.3-2  htmltools_0.3.6   viridisLite_0.2.0
 [7] yaml_2.1.14       rlang_0.1.1       foreign_0.8-69   
[10] glue_1.1.1        modelr_0.1.0      readxl_1.0.0     
[13] bindr_0.1         plyr_1.8.4        stringr_1.2.0    
[16] munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0 
[19] rvest_0.3.2       htmlwidgets_0.9   psych_1.7.5      
[22] evaluate_0.10.1   labeling_0.3      forcats_0.2.0    
[25] httpuv_1.3.5      crosstalk_1.0.0   parallel_3.4.1   
[28] broom_0.4.2       Rcpp_0.12.12      xtable_1.8-2     
[31] scales_0.4.1.9002 backports_1.1.0   jsonlite_1.5     
[34] mime_0.5          mnormt_1.5-5      hms_0.3          
[37] digest_0.6.12     stringi_1.1.5     shiny_1.0.3      
[40] grid_3.4.1        rprojroot_1.2     tools_3.4.1      
[43] magrittr_1.5      lazyeval_0.2.0    pkgconfig_2.0.1  
[46] data.table_1.10.4 xml2_1.1.1        lubridate_1.6.0  
[49] assertthat_0.2.0  rmarkdown_1.6     httr_1.2.1       
[52] R6_2.2.2          nlme_3.1-131      compiler_3.4.1